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Abstract —  Query  driven  Biclustering  Model  refers  to  the 
problem  of  extracting  biclusters  based  on  a  query  gene  or 
query  condition.  The  extracted  biclusters  consist  of  a  set  of 
genes  and  a  subset  of  conditions  that  are  similar  to  the  query 
gene  or  query  condition  and  it  includes  the  query  input  also. 
Two  approaches  applied  for  biclustering  problems  are  top- 
down  and  bottom-up,  based  on  how  they  tackle  the  problems. 
Top-down  techniques  [3,  4]  start  with  the  entire  gene 
expression  matrix  and  iteratively  partition  it  into  smaller 
sub-matrices.  On  the  other  hand,  bottom-up  approach  starts 
with  a  randomly  chosen  set  of  biclusters  that  are  iteratively 
modified,  usually  enlarged,  until  no  local  improvement  is 
possible.  In  this  paper,  the  biological  significance  of  biclusters 
extracted  using  two  query  driven  models  viz  SIMBIC  and 
SIMBIC +  are  compared. This  paper  is  organized  as  follows. 
Section  2  analyzes  the  popular  MSB  algorithm  and  section  3 
introduces  an  improved  version  of  MSB  namely  SIMBIC  model 
and  the  enhanced  model  of  SIMBIC  namely  SIMBIC+  is 
presented  in  section  4.  The  experimental  analysis  and  the 
biological  significance  are  illustrated  in  section  5. 

Index  Terms  -  Data  Mining,  Gene  Expression  Data, 
Biclustering,  Average  Correlation  Value,  Biological 
Significance,  Gene  Ontology 

I.  Introduction 

Existing  biclustering  models  for  microarray  data  analysis 
often  do  not  answer  the  specific  questions  of  interest  to  a 
biologist.  This  lack  of  sharpness  has  prevented  them  from 
surpassing  a  rather  vague  exploratory  role.  Often,  biologists 
have  at  hand  a  specific  gene  or  set  of  genes  (seed  genes) 
which  they  know  or  expect  to  be  related  to  some  common 
biological  pathway  or  function.  In  particular,  this  problem 
formulation  necessitates  various  questions  or  queries,  such 
as  'which  genes  involved  in  a  specific  protein  complex  are 
co-expressed? 

Thus  the  biclustering  problem  is  mathematically  defined 
as  follows.  Let  E  (G,  C)  be  the  expression  matrix  of  size  m  x 
n.  Let  g  be  the  query  gene  of  interest.  The  biclustering 
problem  is  to  find  G'cG  and  C'cC  such  that  the  bi cluster 
B  =  E  (G\  C)  forms  significant  pattern  [5]. 

II.  MSB  Algorithm 

Liu  X.  and  Wang  L.  [4]  developed  a  query  driven 
biclustering  model  namely  Maximum  Similarity  Bicluster 
(MSB),  to  find  an  optimal  bicluster  with  the  maximum  similarity 
score.  A  query  gene  or  set  of  genes  is  given  as  input.  The 
model  constructs  a  similarity  matrix  S(G  C)  based  on  similarity 
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between  the  query  gene  and  the  other  remaining  genes.  A 
gene  or  condition  with  low  similarity  is  eliminated  in  each 
cycle  until  the  similarity  matrix  reduces  to  a  single  element. 
Then  a  bicluster  with  maximum  similarity  is  extracted.  This 
algorithm  could  extract  constant  and  additive  biclusters. 
Biclusters  of  MSB  were  extracted  using  BicAT-plus  to 
compare  different  biclustering  methods  based  on  biological 
merits. 

III.  Simbic  Model 

The  MSB  algorithm  is  improved  by  introducing  t-test 
based  gene  selection,  contribution  to  the  entropy  based 
condition  selection  and  multiple  node  deletion  techniques. 
Similarity  score  between  genes  and  similarity  score  for  a 
bicluster  are  defined  as  in  MSB  [4]. 

Multiple  genes  or  multiple  conditions  with  very  low 
similarity  are  removed  in  every  cycle  until  the  similarity  matrix 
reduces  to  a  single  element.  Then  a  bicluster  with  maximum 
similarity  is  extracted.  The  comparison  of  MSB  and  SIMBIC 
biclustering  models  is  tabulated  in  Table  I. 

Table  I.  Comparison  of  MSB  with  SIMBIC  Model 


MSB 

SIMBIC 

(i) 

Any         Bene        is 
considered     as     a 
query  gene. 

Functionally  important 
Eenes  are  considered  as 
query  genes. 

(ii) 

Any    condition    is 
considered     as     a 
reference  c  ondition. 

The  (a/2]  conditions  that 
hav e  mora  c ontribution 
entropy  are  considered  as 
ref arenc  e  c  ondition 

(iii) 

Numb  er  d  f  iterations 
is  m-ta-2. 

Number  of  iterations  is 
comparatively  less  than 
m-+n-2. 

6*3 

Single  nod?  delation 
method  is  used. 

Multiple  node  deletion 
method  is  used. 

(v) 

The  time  requiredto 
find  on?  bicluster  is 
approximately  is  58 
seconds    for   yeast 
data. 

The  time  required  to  find 
sinele  bicluster  of  same 
size  is  approximately  2 . 5 
seconds  for  yeast  data. 

IV.  Simbich-  Model 

SIMBIC+  biclustering  model  is  an  enhancement  of 
SIMBIC  model  in  which  the  similarity  between  two  genes  is 
defined  based  on  the  ratio  between  the  genes  [2] .  This  ratio- 
based  similarity  measure  extracts  scaling  pattern  biclusters 
rather  than  SIMBIC  which  extracts  constant  and  additive 
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biclusters.  The  comparison  of  SIMBIC  and  SIMBIC+ 
biclustering  models  is  depicted  in  Table  II. 

Table  II.  Comparison  of  SIMBIC+  Model  with  MSB 


MSB 

SIMBIC- 

(i) 

Sind?  nod?  dal^tion 
method  is  used. 

Multiple  node  dalation 
method  is      usad. 

<«) 

Dis  similarity  massur? 
■dapands       on      th? 
ab  s  olut a   dif Faraiic  a 
t  atwaan  th?  rafarana 
Eana  and  any  Eana. 

Dis  similarity  measure 
depends  on  the  ratio 
between  the  referenc a  eene 
and  any  Eene. 

(M 

Similarity      maasura 
depends       on       the 
parameters  u  and  £. 

No  such  paramatars  used 
for  bicluster  identification. 

m 

More  complex. 

Complexity  andnumber  ot 
iterations  ara  reduc  ad. 

w 

Bic-lu  stars  hav  a  las  s 

biological 

sienifkance. 

Biclusters  have  mora 
biological  siEnificance. 

V.  Experimental  Analysis 

In  this  section,  the  performance  of  SIMBIC  and  SIMBIC+ 
bicluster  models  are  evaluated.  Since  microarray  data  has 
large  number  of  features  (genes),  the  majority  of  which  are 
not  relevant  to  the  description  of  the  problem,  it  could 
potentially  degrade  the  gene  expression  analysis  by  masking 
the  contribution  of  the  relevant  features.  Hence  after  applying 
t-test  based  gene  selection  and  contribution  to  the  entropy 
based  condition  selection  [6],  experiments  were  conducted 
on  benchmark  Yeast  Saccharomyces  cerevisae  gene 
expression  dataset.  The  yeast  cell  cycle  expression  dataset 
is  a  time  series  data  that  contains  2,884  genes  and  17 
conditions.  Analysis  of  this  dataset  also  helps  in  antifungal 
drug  discovery. 

A.  Bicluster  Evaluation  Measures 

Two  types  of  measures  namely  quantitative  measures  and 
qualitative  measures  are  used  to  evaluate  biclusters.  The 
quantitative  measures  are  used  to  quantify  a  bicluster  in  terms 
of  size  or  volume  of  bicluster.  The  qualitative  measures  are 
used  to  determine  the  quality  of  extracted  biclusters  in  terms 
of  statistical  measures  namely  variance,  Mean  Squared  Reside 
(MSR)  and  Average  Correlation  Value  (ACV)  [7]. 

Statistical  measures  evaluate  a  bicluster  theoretically,  but 
the  biological  significance  proves  the  real  quality  of  the 
extracted  bicluster.  The  Gene  Ontology  (GO)  tool  renders  the 
biological  significance  in  terms  of  function  of  the  genes  in 
the  bicluster.  To  determine  the  statistical  significance  of  the 
association  of  a  particular  GO  term  with  a  group  of  genes  in 
the  list,  GO  tool  estimates  the  p-value. Though  a  number  of 
tools  are  available  to  find  the  Gene  Ontology  GOTermFinder 
Tool  has  been  used  in  this  paper  to  discover  the  biological 
significance  of  the  genes  in  the  bicluster.  Apart  from  p-value, 
Gene  Ontology  includes  Biological  Process  (BP),  Molecular 
Function  (MF)  and  Cellular  Component  (CC)  of  the  genes  in 
the  bicluster.  Biological  process  refers  to  a  biological  objective 
to  which  the  gene  or  gene  product  contributes.  Molecular 
function  is  defined  as  the  biochemical  activity  of  a  gene 
product.  Cellular  component  refers  to  the  place  in  the  cell 
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where  a  gene  product  is  active.  Lower  p-value  confirms  that 
the  genes  in  the  extracted  bicluster  are  biologically 
significant.  The  significance  of  p-value  is  measured  at  0.1, 
0.05  and  0.01  levels. SIMBIC  algorithm  extracts  maximum 
similarity  bicluster  corresponding  to  the  query  gene.  It  is 
possible  to  extract  both  constant  and  additive  biclusters  using 
this  algorithm.  It  is  observed  that  the  number  of  iterations  in 
order  to  extract  maximum  similarity  bicluster  is  reduced 
considerably  [1].  Thus  for  the  query  gene  288  (YBR198C) 
and  for  the  default  parameter  setting  a  =  0.4,  P  =  0.5  and  y  = 
1.2,  MSB  extracts  a  constant  bicluster  of  size  225  (15  x  15). 
The  genes  in  the  bicluster  are  YAL041 W,  YBR123C,  YBR198C, 
YDL045C,  YGR083C,  YHL023C,  YHR201C,  YJR129C, 
YLL043W,  YLR215C,  YLR317W,  YLR324W,  YLR425W, 
YMR176W,  YPL002C  and  the  conditions  are  2,  3, 4, 6, 7,  8, 9, 
10, 11, 12,  13, 14, 15, 16  and  17.  The  Parallel  Coordinate  (PC) 
plot  of  constant  bicluster  of  MSB  using  BicAT  is  provided  in 
Figl. 
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Fig.  1.  PC  plot  of  constant  bicluster  with  query  gene  YBR198C 
using  BicAT 
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Fig.  2.  PC  plot  of  additive  bicluster  with  query  gene  YBR198C  using 

BicAT 

For  the  same  reference  gene  YBR198C  and  reference 
condition  14  the  size  of  additive  bicluster  using  BicAT  is  77 
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(11x7).  The  PC  plot  of  this  additive  bicluster  is  provided  in 
Fig  2.  The  genes  in  the  bicluster  are  YBL014C,  YBR198C, 
YCL054W,  YLR298C,  YML1 15C,  YMR033W,  YMR308C, 
YOR201C,  YPL002C,  YPL009C,  YPR129W  and  the  conditions 
are  8, 9, 11, 13, 14, 15  and  17. 

B.  Comparison  of  Biological  Significance 

The  biological  significance  of  constant  bicluster  for  the 
reference  gene  YBR198C,  identified  by  Liu  X.  et  al.  [4]  is 
tabulated  in  Table  III.  The  biological  process  of  the  additive 
bicluster  extracted  using  BicAT  for  the  same  reference  gene 
is  given  in  Table  IV.  The  molecular  function  and  cellular 
component  of  the  same  additive  bicluster  is  presented  in 
Table  V  and  Table  VI  respectively. 

Table  III.  Biological  Significance  of  Constant  Bicluster  of  MSB 


Biological  Significance 

1. 

2  out  ofl  f  input  eenes  ar  e  directly  annotated 
torootterm 

biolo  eical  proc  ess  unknown :  YOL042W. 
YER034W 

2. 

2  out  o  F 1  :■  input  eenes  are  directly annotated 

torootterm 

'molecular  function  unknown :  YEROfi  EW. 

YER034W 

3. 

No  siEnificant  ontdoEytermcanbefoundfor 
input  eenes 

Thus  it  is  observed  from  Table  III  that  for  the  constant 
bicluster  there  is  no  biological  significance.  Since  SIMBIC 
and  MSB  extract  identical  biclusters  they  have  identical  Gene 
Ontology.  It  is  evident  from  Table  IV  that  the  additive  bicluster 
has  9  significant  ontologies  related  to  the  biological  process. 
It  is  observed  from  Table  V  that  the  extracted  additive  bicluster 
has  6  significant  molecular  functions.  Also  Table  VI  shows 
that  there  is  only  one  significant  gene  ontology  related  to 
cellular  component.  Thus,  it  is  evident  that  additive  biclusters 
and  scaling  pattern  biclusters  have  more  biological 
significance  than  constant  biclusters. 

Comparison  of  GO  enrichment  of  biclusters  of  Yeast 
dataset  obtained  using  SIMBIC+  and  MSB  is  tabulated  in 
Table  VII.  Since  ratio  based  similarity  is  used  in  SIMBIC+,  it 
is  possible  to  extract  scaling  pattern  biclusters.  It  is  observed 
from  Table  VII  that  the  biclusters  extracted  using  SIMBIC+ 
have  more  biological  significance  than  the  biclusters  of  MSB. 

Table  IV.  Biological  Process  of  Additive  Bicluster 
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Conclusion 

In  real  life  situation,  there  is  a  need  for  finding  set  of  genes 
which  are  correlated  to  the  given  query  and  not  similar  to  the 
given  query.  The  ratio  based  similarity  measure  defined  in 
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Table  V.  Molecular  Function  of  Additive  Bicluster 


Table  VI.  Cellular  Component  of  Additive  Bicluster 


| 

o 

V. 

P 

— 

o 

i 
o 

o 

p 

u 

TC 

rKNA 

in 

fgi-nmiTij) 

o 
W 

p 

YCL054W. 

1 

m»thyltrans 
ferase 

o 

YOR201C 

activity 

rENA 

m 

2 

methyltrans 

ferase 

activity 

o 
o 
p 
o 

o 

YCL054W. 
YOE201C 

ENA 

3 

t- 

m^thvltrans 

ferase 

activity 

D 

p 
o 

p 
o 

YCL0S4W. 
YOR201C 

Eeneral 

RNA 

4 

m 

p  olvm  erase 

n 

t-- 

o 

YBR19BC 

1Q 

o 

o 

YMR033W 

tianscriptio 
n  factor 
activity 

o 

S- 

adeno  sykne 

5 

r- 
m 
t— 

00 

Ill  if 

nil! 

Da 
p 

p 

YCL054W. 
YOR201C 

S 

oa 
■x 

Tnathn^lti-jnr 

feras  e 
activity 

on 
■•c 
t— 
oc 
o 

on 
p 
o 

YCL054W. 
YOR201C 

o 
55 

R 

3 

D 

P 
1 

a 

a 

■ 

p. 

ill  I 
u  1 

YBL014C. 

YBR19BC. 

(M 

YCL054W. 

Ch 

macro 

-3- 

o 

YLR29BC 

1 

(N 

molecular 

t- 

YML115C. 

complex 

O 

YMR033W 

YPL002C. 

YPL009C 

SLMBIC+  is  efficient  in  extracting  highly  correlated  biclusters 
which  helps  to  identify  genes  with  more  biological 
significance.  Thus  SLMBIC+  query  driven  biclustering  model 


outperforms  MSB  and  SIMBIC  models  in  terms  of  biological 
significance. 
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Table  VII.  Comparison  of  GO  Enrichment  of  Biclusters  of  Yeast  Dataset  Obtained  by  SIMBIC+  and  MSB 
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